Model Selection

Multimodal Embedding

# Multimodal Embedding

Unime Phi3.5 V 4.2B

UniME is a general embedding learning model based on a multimodal large model, focusing on breaking down modal barriers to achieve cross-modal retrieval and embedding learning.

Multimodal Alignment

Transformers English

A multimodal embedding model based on Qwen2.5-Omni-7B, supporting unified embedding representations for cross-lingual text, images, audio, and video

Multimodal Fusion

Nomic Embed Multimodal 3b

Nomic Embed Multimodal 3B is a cutting-edge multimodal embedding model focused on visual document retrieval tasks, supporting unified text-image encoding, achieving an outstanding performance of 58.8 NDCG@5 in the Vidore-v2 test.

Text-to-Image Supports Multiple Languages

Colnomic Embed Multimodal 3b

ColNomic Embed Multimodal 3B is a 3-billion-parameter multimodal embedding model specifically designed for visual document retrieval tasks, supporting unified encoding of multilingual text and images.

Multimodal Fusion Supports Multiple Languages

The first retriever specifically designed for financial time series forecasting, based on the Retrieval-Augmented Generation (RAG) framework

Large Language Model

Transformers English

Nitibench Ccl Human Finetuned Bge M3

A fine-tuned version of BAAI/bge-m3 on Thai legal query data, supporting dense retrieval, lexical matching, and multi-vector interaction

Text Embedding Other

LLaVE-7B is a 7-billion-parameter multimodal embedding model based on LLaVA-OneVision-7B, capable of embedding representations for text, images, multiple images, and videos.

Multimodal Fusion

Transformers English

LLaVE-2B is a 2-billion-parameter multimodal embedding model based on Aquila-VL-2B, featuring a 4K token context window and supporting embeddings for text, images, multiple images, and videos.

Transformers English

LLaVE is a multimodal embedding model based on the LLaVA-OneVision-0.5B model, with a parameter scale of 0.5B, capable of embedding text, images, multiple images, and videos.

Multimodal Fusion

Transformers English

Vit Base Patch16 Siglip 512.webli

Vision Transformer model based on SigLIP architecture, containing only the image encoder part, using original attention pooling mechanism

Image Classification

Dse Qwen2 2b Mrl V1

DSE-QWen2-2b-MRL-V1 is a dual-encoder model specifically designed for encoding document screenshots into dense vectors to facilitate document retrieval.

Multimodal Fusion Supports Multiple Languages

GGUF quantized version of the bge-m3 embedding model, suitable for efficient text embedding tasks

Nomic Embed Vision V1.5

High-performance visual embedding model, sharing the same embedding space with nomic-embed-text-v1.5, supporting multimodal applications

Transformers English

Nomic Embed Vision V1

High-performance vision embedding model, sharing the same embedding space with nomic-embed-text-v1, supporting multimodal applications

Transformers English

BGE-M3 is an embedding model that supports dense retrieval, lexical matching, and multi-vector interaction, converted to ONNX format for compatibility with frameworks like ONNX Runtime.

Siglip Base Patch16 224

SigLIP is a vision-language pre-trained model suitable for zero-shot image classification tasks.

Clip Vit Base Patch16

OpenAI's open-source CLIP model, based on Vision Transformer architecture, supporting cross-modal understanding of images and text

Chinese Clip Vit Base Patch16

The base version of Chinese CLIP, using ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder, trained on a large-scale dataset of approximately 200 million Chinese image-text pairs.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase